Techniques and Tools for Making Sense out of Heterogeneous Search Service Results

نویسنده

  • Michelle Q Wang
چکیده

We describe a set of techniques that allows users to interact with results at a higher level than the citation level, even when those results come from a variety of heterogeneous on-line search services. We believe that interactive result analysis allows users to “make sense” out of the potentially many results that may match the constraints they have supplied to the search services. The inspiration for this approach comes from reference librarians, who do not respond to patrons’ questions with lists of citations, but rather give high-level answers that are tailored to the patrons’ needs. We outline here the details of the methods we employ in order to meet our goal of allowing for dynamic, user-directed abstraction over result sets, as well as the prototype tool (SenseMaker) we have built based upon these techniques. We also take a brief look at the more general theory that underlies the tool, and hypothesize that it is applicable to flexible duplicate detection as well. 1.0 Introduction Imagine a reference librarian who responded to all queries from patrons by simply uttering one potentially relevant citation after another. Interacting with such a librarian would undoubtedly be a frustrating experience. In fact, it is hard to envision this scenario ever really taking place. Why not? The first and most obvious reason is that people are good at understanding what constitutes a reasonable answer to a question. In general, people have an intuitive understanding for when a high-level answer should be given in place of an abundance of low-level answers. The second reason, which may not be apparent to the naive patron, is that reference librarians have extensive training and experience in the art of the reference interview. Upon encountering someone with a broadly-stated information need, a good reference librarian will engage the patron in a dialogue designed to discover more precisely the patron’s information needs. Why analyze a scenario that sounds so patently absurd at first reading? Re-envision this scenario, if you will, with an on-line search service standing in for the librarian. Rather than sounding far-fetched, this scenario now describes the fashion in which most users interact with such search services today. Very few on-line search services respond to a query by presenting an abstraction over the matching citations — regardless of how many matching citations there may be. Furthermore, services that do provide users with abstracted views of result sets usually operate according to fixed rules and do not allow for much user control or interaction with the abstracted view. An example of a search service that does incorporate a limited degree of abstraction into its interaction model is Stanford University’s Folio interface to its holdings database (Socrates)1. When users query this service for works by a particular author, they are presented with a list of authority records rather than individual citations matching the user’s constraint. Selecting an authority 1. See also [4] for details of a library system that does abstraction based on Library of Congress classification numbers, subject headings and title keywords derived from MARC records. SUBMITTED TO DIGITAL LIBRARIES ’96 Techniques and Tools for Making Sense out of Heterogeneous Search Service Results 2 record then reveals to the user the citations associated with that authority record. While the displaying of authority records for author-based searches often proves useful, it is still the case in this system that users have little flexibility in determining what kind of abstraction will be presented to them, and they have limited possibilities for interaction with the presented abstraction. In reflecting further upon our scenario, we can make another observation about the differences between a reference librarian’s presentation of results and a search service’s presentation of results (especially a search service that gathers its information from many heterogeneous sources). In general, a good librarian refrains from pointing a patron to sources that the patron would consider indistinguishable. For example, if a reference librarian were asked by a high school student for a good play by Shakespeare, the librarian would probably not direct the student both to the First Folio Edition of Twelfth Night and to the Second Folio Edition of Twelfth Night. It is even less likely that the librarian would direct the student to otherwise-identical hardback and paperback versions of Twelfth Night. The point here is that librarians are very good at judging whether or not two citations refer to the same work in the eyes of the patron, i.e. they are good duplicate detectors. Current search services, on the other hand, do not take into account the spectrum that duplicate citation detection can span. Many search services do not address the question of duplicate citation detection at all, while those search services that do tend to have static, hard-wired definitions for what conditions must be satisfied in order for two citations to be considered identical. We present in this paper techniques and tools that can be used to address the two shortcomings of search services that we have described in this introduction: namely, their inability to abstract dynamically over search results in accordance with a user’s desires, and their inability to perform flexible duplicate detection. In fact, we will claim that the same techniques are applicable to both problems, though we will mainly concentrate here on the problem of abstraction. Furthermore, we argue that the reasons why these techniques work well for the tasks at hand can be found in our general theoretical model of searches for information artifacts. We certainly do not mean to imply that our techniques make the experience of interacting with an automatic search service as rich as that of interacting with a well-trained reference librarian. What we do claim is that we can learn some very good lessons by observing how librarians and patrons interact — and that these lessons can inform our design of tools to improve interaction with automatic services. We also note here that the tool we describe in Section 3.0 is in fact designed with both sophisticated searchers and casual end-users in mind. The rest of this paper is devoted to exploring how we can add interactive result analysis and abstraction to the Digital Library. We believe that the inclusion of these capabilities will make it easier for users to “make sense” out of the results that search services return to them. In Section 2.0 of this paper, we outline our approach to the abstraction problem. We have already taken this approach in our development of a prototype tool for the Stanford University Digital Library testbed, and we will describe its implementation and interface in Section 3.0 of the paper. In Section 4.0, we take a closer look at the theoretical model of information artifact search that underlies this work. In Section 5.0, we look at the problem of duplicate detection as a kindred problem to the abstraction problem, and point out how our implementation will be extended to address duplicate detection as well. Finally, Section 6.0 offers a summary of our current approach and identifies the areas in which we will be undertaking further research. 2.0 Abstracting over Result Sets from Heterogeneous Sources As we have observed, librarians are good at giving out high-level answers to patrons who ask a question for which there are many possible relevant citations. Furthermore, librarians may try to figure out which type of high-level answer is most suited to the patron by engaging in a reference interview. How can we incorporate some level of flexible, customized abstraction into the tools we create for existing on-line search services? 2.1 How can we approximate abstraction? We begin by looking first at how we might perform simple abstraction over result sets at all. Currently, many on-line search services respond to queries with lists of citations, where citations are represented by attributevalue pairs (e.g., WebCrawler, a World Wide Web spider, responds to queries by providing values for rank, title, and URL attributes). The availability of these attribute-value pairs suggests a good method for operaSUBMITTED TO DIGITAL LIBRARIES ’96 Techniques and Tools for Making Sense out of Heterogeneous Search Service Results 3 tionalizing and approximating the process of performing abstraction over the citations. We can group together citations on the basis of their attribute values, and then describe the result set in terms of the citation groups rather than in terms of the citations themselves. For example, we might group together all citations that have the same title. The proposed approach (which could be augmented by other abstraction techniques as they become available) borrows both from SQL grouping facilities [1] and from statistical clustering techniques [12]. In SQL (and many other database languages), users can request that database rows (a database row is essentially a list of attribute-value pairs) be grouped together if they have identical values for a particular attribute (or attributes). Statistical clustering techniques, on the other hand, typically use more complex measures of similarity to cluster elements together. In addition, it is possible to have techniques that allow for an element to be assigned to more than one cluster (if certain conditions are satisfied). 2.2 How can we make abstraction flexible and customizable? Now that we have seen how abstraction can be performed, we turn to looking at how a user might be able to control what kind of abstraction is performed. We have observed that in an SQL environment, grouping is performed with respect to a user’s specification of attribute names. For SQL grouping, this is all that must be specified since grouping is performed only when rows have identical values for a particular attribute — no input is expected about similarity metrics. In contrast, statistical clustering algorithms can usually be parameterized in several ways (e.g., what is the similarity metric, what is the similarity threshold for clustering). We feel that expecting users to make low-level decisions about clustering algorithm details is not appropriate in the Digital Library. However, we do believe that it is important to grant a user control over what kind of abstraction is performed. The user must have control because he/she is interacting with an automatic service that lacks common sense. We cannot expect an automatic service to consistently determine the most appropriate abstraction for the situation, nor can we expect it to conduct a sophisticated reference interview.1 We remedy these two conflicting desires — the wish to avoid burdening the user with low-level decisions and the wish to grant users the power to determine how abstraction is performed — by introducing an intermediate conceptual level to our design. The distinctions made at this intermediate level are high-level, but they are mapped “under the hood” to a particular set of parameters given to our abstraction module. The choice of what distinctions are available should be dependent on good design principles and on studies of what distinctions are valuable to users. In Section 3.0, we will detail the choices that we have made for our prototype tool. In addition to giving users control over what kind of abstraction should be performed, we would also like to give users the ability to interact with the created abstractions. Important types of interaction include the ability to refine the automatic groupings that are presented (e.g. by collapsing, exploding, or editing the groupings), the ability to focus on particular groups, and the ability to request recursive abstraction for a focus group. The latter two possibilities for interaction have already been demonstrated in the work at Xerox PARC on Scatter-Gather ([1], [2]). Scatter-Gather, used in conjunction with full-text searches, operates by clustering together texts that its algorithm judges to be similar, then allowing the user to “gather” together some of those clusters for recursive clustering. Since the Scatter-Gather algorithm was developed with homogeneous full-text sources in mind, it does not address the question of how to handle interaction in the face of heterogeneous sources with multiple clustering possibilities over multiple attributes. 2.3 How can we make abstraction work in a heterogeneous environment? The strategies we have outlined in the previous two sections work straightforwardly in the case where we are accessing only one search service or a set of homogeneous search services (services that use the same set of attributes and the same set of conventions about how attribute values should be encoded and queried). However, when we look at how these ideas apply in a heterogeneous environment, the situation rapidly becomes much more complex. We illustrate the problem by looking at a scenario in which we would like to query both the Dialog Computer Database (containing journal and magazine articles from the domains of 1. At the same time, we do expect that services will be able to make good “guesses” about what kind of abstraction is appropriate — and hence that users should always be supplied with a default abstraction choice. SUBMITTED TO DIGITAL LIBRARIES ’96 Techniques and Tools for Making Sense out of Heterogeneous Search Service Results 4 computers, telecommunications, and electronics) and the WebCrawler index of World Wide Web documents. Even assuming that we normalize attribute names (in fact, WebCrawler does not actually give an attribute name for its title values, even though it is clear that they are titles), we are still faced with the fact that certain attributes make sense for World Wide Web documents but not for magazine articles, and vice versa. For example, World Wide Web documents, but not magazine articles, have URLs. Further complicating the matter is that some attribute values are not readily available even when the attribute does make sense. We intuitively feel that both World Wide Web documents and magazine articles should have an “author” attribute, but it is rare that we are able to determine automatically the author of a World Wide Web document. There are not yet standard conventions for how World Wide Web document authors should be declared, whereas we do have conventions in the traditional publishing world. In essence, abstraction over the results from individual heterogeneous sources can only work if we have a canonical set of attribute-value pairs to which we can appeal. In order to obtain this canonical set, we must have methods for translating from the individual source attribute-value pairs to the canonical versions. This problem can be viewed as the inverse of the attribute translation component of query translation (in which queries written in a canonical front-end language are translated into queries that are native to individual heterogeneous sources). Two “forward” attribute translation approaches that have influenced our solution to the inverse attribute translation problem are those taken by the Xerox PARC GAIA effort ([8], [9]) and by Paepcke [7]. In GAIA, users may either make reference to universal attributes that are fixed for the front-end query language, or they may make reference to attributes that are native to a particular source. The query translator is responsible for translating universal attributes to their native equivalents. In contrast, the approach described by Paepcke moves away from the dichotomy of universal and specific attribute sets. Paepcke models sources as types, organizes the types into an inheritance hierarchy, and introduces a typed query language for the front-end language. This approach allows users to make reference to attributes that are neither universal nor native, but rather are common to a group of related sources. As we pointed out earlier, abstraction over results from heterogeneous sources requires a shared attribute set. In the case where every query is sent to a fixed collection of source services, a well-designed universal attribute set (as in GAIA) is appropriate for abstraction purposes. If, however, the user is granted control over which search services should be consulted for a given query, then this solution has some important drawbacks. For example, a source collection might include WebCrawler, Lycos (another WWW spider), and the Dialog Computer Database. URL-based abstraction makes sense if the user has selected only Web-based services, but is not as useful if the Dialog Computer Database has also been selected. Since one of our design goals is to increase the degree of possible user customization, the approach we adopt is more similar to that described by Paepcke. We introduce the concept of a type hierarchy into the design, where each node in the hierarchy corresponds to a particular information object type (e.g. World Wide Web document). Furthermore, each node has associated with it at least one canonical set of attributes. For each individual search service, we identify it with at least one particular node in the tree, and define mappings for the service that can translate its native attributes and attribute values into the canonical attribute descriptions associated with its corresponding node (as well as into the canonical attribute descriptions associated with the ancestors of that node). Adding a new source (or information object type) is easy to do with this design. A simple version of an information object type hierarchy is presented in Figure 1. FIGURE 1. Information Object Type Hierarchy Web Document Published Document Document Information Object

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Designing and Presenting of a Model of Sense Making in Service Organizations

Organizational sense making is the process that helps managers to understand how organization's members change ideas and eventually what they chose, maintain and achieve among different meanings. �This study examines and presents a model of sense making in service organizations and results in addition to the research community in other service organizations such as municipalities and municipal ...

متن کامل

Feuerstein's Theory of Mediation and Its Impact on EFL Teachers’ Sense of Efficacy

Earlier self efficacy studies have been blamed for their methodological weakness and their mere reliance on self-report, survey, and correlational techniques for data collection. The purpose of this study; therefore, was to assess the impact of Feuerstein’s theory of mediation on EFL teachers’ sense of efficacy through direct observation rather than self reports and to use experimental techniqu...

متن کامل

Weighted-HR: An Improved Hierarchical Grid Resource Discovery

Grid computing environments include heterogeneous resources shared by a large number of computers to handle the data and process intensive applications. In these environments, the required resources must be accessible for Grid applications on demand, which makes the resource discovery as a critical service. In recent years, various techniques are proposed to index and discover the Grid resource...

متن کامل

Design of homogeneous and heterogeneous human equivalent thorax phantom for tissue inhomogeneity dose correction using TLD and TPS measurements

Background: The purpose of this study is to fabricate inexpensive in-house low cost homogeneous and heterogeneous human equivalent thorax phantom and assess the dose accuracy of the Treatment Planning Systems (TPS) calculated values for different lung treatment dosimetery. It is compared with Thermoluminescent Dosimeter (TLD) measurement. Materials and Methods: Homogeneous and heterogeneous tho...

متن کامل

Fuzzy retrieval of encrypted data by multi-purpose data-structures

The growing amount of information that has arisen from emerging technologies has caused organizations to face challenges in maintaining and managing their information. Expanding hardware, human resources, outsourcing data management, and maintenance an external organization in the form of cloud storage services, are two common approaches to overcome these challenges; The first approach costs of...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1998